GradQuant: Introduction to Programming in R

Introduction

Installing Packages

# install.packages("pacman")

Can you install tidyverse?

Installing Packages

# install.packages("tidyverse")

Loading Packages

library(pacman)

Can you load tidyverse?

Loading Packages

library(tidyverse)

Loading Packages

Pacman allows you to load multiple packages at once

library(pacman)
#install.packages("Rmisc")
p_load(tidyverse,
       Rmisc
)

Pacman also allows you to unload multiple packages at once

p_unload(tidyverse,
         Rmisc
         )

Asking for Help

?cor

Can you ask for help for using linear regression (Hint: The linear regression function is lm() )

?lm

Working Directory

What is the working directory?

getwd()
[1] "/Users/jacobelder/Documents/GitHub/GradQuant-Workshops/IntroR_W2023"

Working Directory

How to change the working directory?

# setwd("~")

Operations

R as a Basic Calculator

Print the number 2

2
[1] 2

Add the number 2 and the number 1 together

2 + 1
[1] 3

Subtract 2 from 6

6 - 2
[1] 4

Multiply 2 by 3

2 * 3
[1] 6

R as a Basic Calculator

Square 2

2^2
[1] 4

Find the remainder of 7 after divison by 2

7%%2
[1] 1

Square root of 4

sqrt(4)
[1] 2

R as a Basic Calculator

Assignment: Derive the area for a circle if the radius is 37.5. (Hint: pi is stored in R as “pi”)

Answer

pi * 37.5^2
[1] 4417.865

R as a Basic Calculator

Assign the number 2 to a variable titled “two” (Hint: Use “<-” or “=” for assignment)

Answer

two <- 2

Use auto-printing

two
[1] 2

Use explicit printing

print(two)
[1] 2

Generate a sequence of numbers from 1 to 10 and assign to “domain”

domain <- 1:10

Add sequence of numbers to variable two and assign to variable “range”

Answer

range <- two + domain

Create a vector of numbers 3, 6, 9 and assign to variable “dots” Note: c() stands for concatenate. Vectors contain variables only of same class

dots <- c(3, 6, 9)

Plot dots

Note: This is base R plotting which is generally less favored than ggplot2. It can be useful for quick plotting, like here.

plot(dots)

Try to create a vector with a string at end.

Answer

broken <- c(3, 6, 9, "broke")

What do you notice about when you add a different class to the end of the vector?

broken
[1] "3"     "6"     "9"     "broke"

Assignment: Generate a slope-intercept function with a y-intercept (b) of 2, a slope of 4, and a domain set of numbers containing 2, 7, -1, 5, 3, 2, 11, 3. Write the function as y = mx + b by storing each number as a variable.

Answer

m = 4
b = 2
x = c(2, 7, -1, 5, 3, 2, 11, 3)

y = m*x + b

Assignment: Print the range assigned as “y”.

Answer

print(y)
[1] 10 30 -2 22 14 10 46 14

Assignment: Plot “y”

Answer

plot(y)

Classes

Examine the class

class(y)
[1] "numeric"

What is its class?

There are a variety of atomic (i.e., fundamental) classes…

Character

class("apple")
[1] "character"
class("p")
[1] "character"
class("1")
[1] "character"

Numeric (real numbers)

class(1.23343)
[1] "numeric"
class(pi)
[1] "numeric"
class(2.0)
[1] "numeric"

Integer

class(1)
[1] "numeric"
class(2)
[1] "numeric"
class(two)
[1] "numeric"

Logical (True/False)

class(TRUE)
[1] "logical"
class(FALSE)
[1] "logical"
class(F)
[1] "logical"
class(T)
[1] "logical"

Missing Values

NA
[1] NA
NaN
[1] NaN

Categorical Factors

This is not a class but… Variables can be converted to categorical factors for data analysis. Note: If a variable is a character and is used in a statistical model, it will default to a factor.

fruit <- c("apple", "orange", "banana")
fruit
[1] "apple"  "orange" "banana"
fruit <- as.factor(fruit)
fruit
[1] apple  orange banana
Levels: apple banana orange

Class Coercion

You can explicitly coerce your objects to become other classes. For example, 1’s and 0’s can be converted to logical TRUEs and FALSEs

binary <- c(1,0,0,1)
binary
[1] 1 0 0 1
binary <- as.logical(binary)
binary
[1]  TRUE FALSE FALSE  TRUE

Numbers can be converted to characters

numbers <- c(1,2,3)
numbers
[1] 1 2 3
numbers <- as.character(numbers)
numbers
[1] "1" "2" "3"

Matrices

Matrices are vectors with another dimension, so n rows by m columns. Like vectors, they must be of one class # and will otherwise be coerced.

You can create an empty matrix of 3 rows and 2 columns.

m <- matrix(nrow = 3, ncol = 2)
m
     [,1] [,2]
[1,]   NA   NA
[2,]   NA   NA
[3,]   NA   NA

Matrices

You can create a populated matrix of 3 rows and 3 columns.

m <- matrix(1:9, nrow=3, ncol=3)
m
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

Matrices

You can bind together vectors to create a matrix. For example, using cbind() to bind columns.

numbers1 <- c(1,2,3)
numbers2 <- c(4,5,6)
cnums <- cbind(numbers1,numbers2)
cnums
     numbers1 numbers2
[1,]        1        4
[2,]        2        5
[3,]        3        6

Matrices

For example, using rbind() to bind rows.

rnums <- rbind(numbers1, numbers2)
rnums
         [,1] [,2] [,3]
numbers1    1    2    3
numbers2    4    5    6

For special cases, you can go beyond a 2-dimensional object using arrays. See a 2x5x4 array of NAs/missing values below:

dim(array(NA, c(2,5,4)))
[1] 2 5 4

Lists

Lists are a special type of object that can contain elements of different classes. Below is a list:

fruitandnums <- list(1, "apple", 3)
fruitandnums
[[1]]
[1] 1

[[2]]
[1] "apple"

[[3]]
[1] 3

Lists

Lists can also contain objects of different types and sizes. Here is a list containing a matrix, a list, and a vector.

hodgepodge <- list(m, fruitandnums, numbers1)
hodgepodge
[[1]]
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

[[2]]
[[2]][[1]]
[1] 1

[[2]][[2]]
[1] "apple"

[[2]][[3]]
[1] 3


[[3]]
[1] 1 2 3

Lists

Lists can also be named

hodgepodge <- list(my_mat = m, my_fruit = fruitandnums, my_number = numbers1)
hodgepodge
$my_mat
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

$my_fruit
$my_fruit[[1]]
[1] 1

$my_fruit[[2]]
[1] "apple"

$my_fruit[[3]]
[1] 3


$my_number
[1] 1 2 3

Operations

Arithmetic Operations were covered above # +, -, /, *, ^, sqrt(), %%, etc.

But there are other operations as well.

Relative Operations

1 < 2
[1] TRUE
1 <= 1
[1] TRUE
1 == 1
[1] TRUE
T == T
[1] TRUE
T > F
[1] TRUE
"a" == "a"
[1] TRUE
"a" != "b"
[1] TRUE

Logical operations: AND

& is AND

returns true when both conditions are true

T & T
[1] TRUE
T & F
[1] FALSE
c(T,T) & c(T,F)
[1]  TRUE FALSE
(.5 > 0) & (1 > .5)
[1] TRUE
(1 > 2) & (1 > .5)
[1] FALSE

Logical perations: OR

| is OR

returns true when at-least one of the condition is true

T | T
[1] TRUE
T | F
[1] TRUE
F | F
[1] FALSE
c(T,T,F) | c(T,F,F)
[1]  TRUE  TRUE FALSE
(.5 > 0) | (1 > .5)
[1] TRUE
(1 > 2) | (1 > .5)
[1] TRUE

Logical Operations: Negation

! is negation

!(T | F)
[1] FALSE
!(100 > 0)
[1] FALSE

Logical Operations: which()

which can be useful for logical vectors, determining the indices of which are TRUE

which(c(T,F,T,F,T,F,T))
[1] 1 3 5 7

Reading and Writing Data

Load and inspect base R’s “mtcars” dataset

Load mtcars dataset

data("mtcars")

Inspect it

#View(mtcars)

Inspect the column names

colnames(mtcars)
 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
[11] "carb"

Inspect mtcars

Examine dimensions of the data

dim(mtcars)
[1] 32 11

Examine the head (top 6 rows) and tail (bottom 6 rows) of mtcars

head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
tail(mtcars)
                mpg cyl  disp  hp drat    wt qsec vs am gear carb
Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.7  0  1    5    2
Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4
Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8
Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.6  1  1    4    2

What is in the bottom six and top six rows?

Inspect dataset

What kind of class is mtcars?

Examine class of dataset

class(mtcars)
[1] "data.frame"

You can also inspect and call variable names using “$”

mtcars$mpg
 [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
[16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4
mtcars$cyl
 [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
mtcars$disp
 [1] 160.0 160.0 108.0 258.0 360.0 225.0 360.0 146.7 140.8 167.6 167.6 275.8
[13] 275.8 275.8 472.0 460.0 440.0  78.7  75.7  71.1 120.1 318.0 304.0 350.0
[25] 400.0  79.0 120.3  95.1 351.0 145.0 301.0 121.0

What kind of class is mpg?

Check class of mpg

class(mtcars$mpg)
[1] "numeric"

Indexing

Index first row

mtcars[1,]
          mpg cyl disp  hp drat   wt  qsec vs am gear carb
Mazda RX4  21   6  160 110  3.9 2.62 16.46  0  1    4    4

Index second row

mtcars[2,]
              mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4 Wag  21   6  160 110  3.9 2.875 17.02  0  1    4    4

Index 3rd through 5th rows

mtcars[3:5,]
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2

Index 4th column

mtcars[,4]
 [1] 110 110  93 110 175 105 245  62  95 123 123 180 180 180 205 215 230  66  52
[20]  65  97 150 150 245 175  66  91 113 264 175 335 109

You can index without the comma but it will be interpreted as a data.frame, in contrast to above being a vector.

mtcars[4]
                     hp
Mazda RX4           110
Mazda RX4 Wag       110
Datsun 710           93
Hornet 4 Drive      110
Hornet Sportabout   175
Valiant             105
Duster 360          245
Merc 240D            62
Merc 230             95
Merc 280            123
Merc 280C           123
Merc 450SE          180
Merc 450SL          180
Merc 450SLC         180
Cadillac Fleetwood  205
Lincoln Continental 215
Chrysler Imperial   230
Fiat 128             66
Honda Civic          52
Toyota Corolla       65
Toyota Corona        97
Dodge Challenger    150
AMC Javelin         150
Camaro Z28          245
Pontiac Firebird    175
Fiat X1-9            66
Porsche 914-2        91
Lotus Europa        113
Ford Pantera L      264
Ferrari Dino        175
Maserati Bora       335
Volvo 142E          109

Index 2 and 5th column

mtcars[,c(2,5)]
                    cyl drat
Mazda RX4             6 3.90
Mazda RX4 Wag         6 3.90
Datsun 710            4 3.85
Hornet 4 Drive        6 3.08
Hornet Sportabout     8 3.15
Valiant               6 2.76
Duster 360            8 3.21
Merc 240D             4 3.69
Merc 230              4 3.92
Merc 280              6 3.92
Merc 280C             6 3.92
Merc 450SE            8 3.07
Merc 450SL            8 3.07
Merc 450SLC           8 3.07
Cadillac Fleetwood    8 2.93
Lincoln Continental   8 3.00
Chrysler Imperial     8 3.23
Fiat 128              4 4.08
Honda Civic           4 4.93
Toyota Corolla        4 4.22
Toyota Corona         4 3.70
Dodge Challenger      8 2.76
AMC Javelin           8 3.15
Camaro Z28            8 3.73
Pontiac Firebird      8 3.08
Fiat X1-9             4 4.08
Porsche 914-2         4 4.43
Lotus Europa          4 3.77
Ford Pantera L        8 4.22
Ferrari Dino          6 3.62
Maserati Bora         8 3.54
Volvo 142E            4 4.11

Index column for “cyl”

mtcars[,"cyl"]
 [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4

Index column for “cyl” and “wt”

mtcars[,c("cyl","wt")]
                    cyl    wt
Mazda RX4             6 2.620
Mazda RX4 Wag         6 2.875
Datsun 710            4 2.320
Hornet 4 Drive        6 3.215
Hornet Sportabout     8 3.440
Valiant               6 3.460
Duster 360            8 3.570
Merc 240D             4 3.190
Merc 230              4 3.150
Merc 280              6 3.440
Merc 280C             6 3.440
Merc 450SE            8 4.070
Merc 450SL            8 3.730
Merc 450SLC           8 3.780
Cadillac Fleetwood    8 5.250
Lincoln Continental   8 5.424
Chrysler Imperial     8 5.345
Fiat 128              4 2.200
Honda Civic           4 1.615
Toyota Corolla        4 1.835
Toyota Corona         4 2.465
Dodge Challenger      8 3.520
AMC Javelin           8 3.435
Camaro Z28            8 3.840
Pontiac Firebird      8 3.845
Fiat X1-9             4 1.935
Porsche 914-2         4 2.140
Lotus Europa          4 1.513
Ford Pantera L        8 3.170
Ferrari Dino          6 2.770
Maserati Bora         8 3.570
Volvo 142E            4 2.780

Index column for “cyl” and “wt” and move “wt” before “cyl”

mtcars[,c("wt","cyl")]
                       wt cyl
Mazda RX4           2.620   6
Mazda RX4 Wag       2.875   6
Datsun 710          2.320   4
Hornet 4 Drive      3.215   6
Hornet Sportabout   3.440   8
Valiant             3.460   6
Duster 360          3.570   8
Merc 240D           3.190   4
Merc 230            3.150   4
Merc 280            3.440   6
Merc 280C           3.440   6
Merc 450SE          4.070   8
Merc 450SL          3.730   8
Merc 450SLC         3.780   8
Cadillac Fleetwood  5.250   8
Lincoln Continental 5.424   8
Chrysler Imperial   5.345   8
Fiat 128            2.200   4
Honda Civic         1.615   4
Toyota Corolla      1.835   4
Toyota Corona       2.465   4
Dodge Challenger    3.520   8
AMC Javelin         3.435   8
Camaro Z28          3.840   8
Pontiac Firebird    3.845   8
Fiat X1-9           1.935   4
Porsche 914-2       2.140   4
Lotus Europa        1.513   4
Ford Pantera L      3.170   8
Ferrari Dino        2.770   6
Maserati Bora       3.570   8
Volvo 142E          2.780   4

Index 3rd row and 5th column

mtcars[3,5]
[1] 3.85

Index 2nd, 3rd, and 5th column and 3rd through 6th rows

mtcars[3:6, c(2,3,5)]
                  cyl disp drat
Datsun 710          4  108 3.85
Hornet 4 Drive      6  258 3.08
Hornet Sportabout   8  360 3.15
Valiant             6  225 2.76

Subset rows of wt that are less than 3

mtcars[mtcars$wt < 3,]
                mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Subset rows where vs is 1

mtcars[mtcars$vs == 1,]
                mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Valiant        18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280       19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C      17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Subset rows where carb is NOT 4

mtcars[mtcars$carb != 4,]
                   mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 450SE        16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL        17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC       15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Fiat 128          32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic       30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla    33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona     21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Dodge Challenger  15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin       15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Pontiac Firebird  19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
Fiat X1-9         27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2     26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa      30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Ferrari Dino      19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Maserati Bora     15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Volvo 142E        21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Writing

  • There are multiple different file types that could be used. Probably the most commonly used are CSVs (comma separated values) which represent your data as text delimited by commas.
  • While you may open CSV files in Excel, there are some important to distinctions.
  • CSV is a plain text format with a series of values separated by commas whereas Excel is a binary file that holds information about all the worksheets in a workbook.
  • CSV file can’t perform operations on data while Excel can perform operations on the data.
  • Comparing CSV vs XLSX, CSV files are faster and also consumes less memory whereas Excel consumes more memory while importing data.

Writing

While less commonly used, you can also read and write text data as tab delimited, semicolon delimited, space delimited, etc.

There are also proprietary data types for software such as SPSS (.sav), STATA (.dta), SAS (.sas7bdat)

Write csv to your working directory (comma delimited)

#install.packages("here")
library(here)
i_am("./IntroR_W2023/GQ_IntroR.qmd")
write.csv(mtcars, "mtcars.csv", row.names = F)

Can also write csv to you working directory that is semicolon delimited

write.table(mtcars, file="mtcarsSC.csv",quote=TRUE, sep = ";")

Can also write tsv to your working directory

write.table(mtcars, file = "mtcars.tsv", row.names=FALSE, sep="\t")

To write and import xlsx files, you will need the xlsx package

#install.packages('xlsx')     
library(xlsx)
write.xlsx(mtcars, file = "mtcars.xlsx", row.names = T)

The “haven” package will allow you to read and write SPSS files such as .sav, SAS files, and STATA files

#install.packages("haven")
library(haven)
haven::write_sav(mtcars, "mtcars.sav")

You can save you environment/workspace as well

save.image("mtcars_space.RData")

Clear workspace

rm()

Load workspace

load("mtcars_space.RData")

Reading Data

Base read csv

df <- read.csv("mtcars.csv",header = T)

readr’s read_csv much faster

df <- readr::read_csv("mtcars.csv")

fread 2.5x faster than read_csv fread is fast and efficient but is also cool because it automatically detects the number of columns, rows, and the delimiter! So it will determine if your input is tab separated or comma separated fo example.

df <- data.table::fread("mtcars.csv")

We can also read in that mtcars dataset again as a tsv or as a ; separated dataset

mtcars <- read.csv("mtcarsSC.csv",header = T, sep = ";")

Read in the tsv file with readr

mtcars <- read_tsv("mtcars.tsv")

Read in the tsv file with base

mtcars <- read.table("mtcars.tsv", sep = "\t", header = T)

Importing a sav file from SPSS

# read_sav("mtcars.sav")

Reading in an xlsx file

dfxlsx <- read.xlsx("mtcars.xlsx", header = T, sheetIndex = 1, row.names = 1)

Functions for Reading Files

  • read.table, read.csv, for reading tabular data
  • readLines, for reading lines of a text file
  • source, for reading in R code files (inverse of dump)
  • dget, for reading in R code files (inverse of dput)
  • load, for reading in saved workspaces
  • unserialize, for reading single R objects in binary form

Functions for Writing Files

  • write.table, for writing tabular data to text files (i.e. CSV) or connections
  • writeLines, for writing character data line-by-line to a file or connection
  • dump, for dumping a textual representation of multiple R objects
  • dput, for outputting a textual representation of an R object
  • save, for saving an arbitrary number of R objects in binary format (possibly compressed) to a file.
  • serialize, for converting an R object into a binary format for outputting to a connection (or file).

Inspect

What is the row number for this dataset?

nrow(df)
[1] 32

What is the column number for this dataset?

ncol(df)
[1] 11

What are the dimensions for this dataset? Use the function for this.

dim(df)
[1] 32 11

What are the variable names for the dataset?

colnames(df)
 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
[11] "carb"

Control Flow

We want to convert Farenheit to Celsius Farenheit to Celsius is (F-32)*(5/9)

Farenheit to Celsius

What is 72F in C?

(72-32)*(5/9)
[1] 22.22222

What is 50F in C?

(50-32)*(5/9)
[1] 10

What is 32F in C?

(32-32)*(5/9)
[1] 0

What is 102F in C?

(102-32)*(5/9)
[1] 38.88889

What is 20F in C?

(20-32)*(5/9)
[1] -6.666667

We have to re-write or copy and paste that code each time. Is there a more efficient way?

Custom F-to-C Function

FtoC <- function(temp_F) {
  temp_C <- (temp_F - 32) * 5 / 9
  return(temp_C)
}

Breaking Down Custom Function Logic

functionName <- function(arg1, arg2){
  processes
  return(processes)
}

Returning to the Redundant Code

What is 72F in C?

FtoC(72)
[1] 22.22222

What is 50F in C?

FtoC(50)
[1] 10

What is 32F in C?

FtoC(32)
[1] 0

What is 102F in C?

FtoC(102)
[1] 38.88889

What is 20F in C?

FtoC(20)
[1] -6.666667

Ultimately, you see that less code is needed here.

Order of Arguments

Exercise: Concatenating two strings with paste0 produces: paste0(‘a’,‘b’)… ‘ab’.

Write a function called ‘fence’ that takes two parameters called original and wrapper and returns a new string that has the wrapper character at the beginning and end of the original. A call to your function as fence(‘word’,‘*’) should produce ‘*word*’

fence <- function(input, wrapper){
  paste0(wrapper, input, wrapper)
}

Wrap “+” around “pizza”

fence("pizza","+")
[1] "+pizza+"

What happens if you switch the order though?

fence("+","pizza")
[1] "pizza+pizza"

Order of inputs matters! If you order the inputs in the order of the arguments, it will make no difference.

But if you order the inputs in a different order than the arguments in how the function is specified, you must make arguments explicit. For example…

fence(wrapper="+",input="pizza")
[1] "+pizza+"

Default Arguments

We can also add defaults for the functions. For example, maybe if someone doesn’t specify a wrapper, we assume they just want to print the word without a wrapper.

Do you have an intuition for how we might implement that?

Ok, here’s what it would look like:

fence <- function(input, wrapper=""){
  paste0(wrapper, input, wrapper)
}

Try it out on “pizza” as the input word without a wrapper using the variant of the function with an empty default for the wrapper

fence("pizza")
[1] "pizza"

You may have noticed that in the first example for converting temperatures, we used a return argument. While for this example, we did not. return is not required per se. If you did not call return, the custom function will return the last output of the function. But if you want to be explicit or call something that is not the last output, you should use return().

Weighted Mean

We want to estimate the average miles per gallon of each car

mean(mtcars$mpg)
[1] 20.09062

Now, we want to compute the mean miles per gallon of the car weighted by multiple different variables

Weighted by weight

sum( mtcars$mpg * mtcars$wt ) / sum(mtcars$wt)
[1] 18.54993

Weighted by 1/4 mile time

sum( mtcars$mpg * mtcars$qsec ) / sum(mtcars$qsec)
[1] 20.33536

Weighted by gross horsepower

sum( mtcars$mpg * mtcars$hp ) / sum(mtcars$hp)
[1] 17.97245

Weighted by number of carburetors

sum( mtcars$mpg * mtcars$carb ) / sum(mtcars$carb)
[1] 18.24333

However, doing this requires more code than is necessary.

Can you create a weighted mean custom function?

W =\frac{\sum_{i=1}^{n} w_{i} X_{i}}{\sum_{i=1}^{n} w_{i}}

W =\frac{\sum_{i=1}^{n} w_{i} X_{i}}{\sum_{i=1}^{n} w_{i}}

Answer

wMean <- function(average, weighted){
  output <- sum(average * weighted) / sum(weighted)
  return(output)
}

Try out mpg weighted by wt, qsec, hp, carb

attach(mtcars)
wMean(mpg, wt)
[1] 18.54993
wMean(mpg, qsec)
[1] 20.33536
wMean(mpg, hp)
[1] 17.97245
wMean(mpg, carb)
[1] 18.24333
detach(mtcars)

Hypotenuse Custom Function

Can you create a formula for calculating the hypotenuse of a triangle?

\(a^2 + b^2 = h^2\)

\(\sqrt{a^2 + b^2} = h\)

hyp <- function(side_a, side_b) {
  a <- side_a^2
  b <- side_b^2
  h <- sqrt(a + b)
  return(h)
}

Test it out!

hyp(3, 4)
[1] 5
hyp(9, 12)
[1] 15
hyp(3,5)
[1] 5.830952

Local Variables

You’ll see that h is defined within the function. Let’s try and print it outside the function.

try( print(h) )
Error in print(h) : object 'h' not found

Woah, there’s an error. Why is that?

Variables defined a function are what are called “local variables” that are only available and used in the local environment within the function but not for usage outside the function in the global environment.

Null Arguments

Perhaps we want to implement a function which defaults to use a default input if the argument is left empty. How would we accomplish that? Let’s say for the hypotenuse function we wanted to use side a’s value if side b was not entered as an input.

hyp <- function(side_a, side_b = NULL) {
  a <- side_a^2
  b <- ifelse(is.null(side_b), a, side_b^2)
  h <- sqrt(a + b)
  return(h)
}

These should be identical

hyp(9, 9)
[1] 12.72792
hyp(9)
[1] 12.72792

Control program flow with `if` and `else` statements

x <- 1 # input of 1
if (x == 0) {
  paste(x, "is exactly 0")
} else if (x < 0) {
  paste(x, "is negative")
} else {
  paste(x, "is positive")
} 
[1] "1 is positive"
x <- -1 # input of -1
if (x == 0) {
  paste(x, "is exactly 0")
} else if (x < 0) {
  paste(x, "is negative")
} else {
  paste(x, "is positive")
} 
[1] "-1 is negative"

Let’s combine it with what we’ve learned about custom functions to make it less code:

posOneg <- function(x){
  if (x == 0) {
  paste(x, "is exactly 0")
  } else if (x < 0) {
    paste(x, "is negative")
  } else {
    paste(x, "is positive")
  } 
}
posOneg(1)
[1] "1 is positive"
posOneg(-1)
[1] "-1 is negative"
posOneg(0)
[1] "0 is exactly 0"

For loops

for (i in c(1, 2, 3, 4, 5)) {
  # "i" will take on each value of the vector 1:5
  print(paste("This is loop iteration", i))
}
[1] "This is loop iteration 1"
[1] "This is loop iteration 2"
[1] "This is loop iteration 3"
[1] "This is loop iteration 4"
[1] "This is loop iteration 5"

Loop Over Days of the Week

days_of_the_week <- c("Mon", "Tues", "Wed", "Thur", "Fri", "Sat", "Sun")

for (i in days_of_the_week) {
  print(paste("Today is", i))
}
[1] "Today is Mon"
[1] "Today is Tues"
[1] "Today is Wed"
[1] "Today is Thur"
[1] "Today is Fri"
[1] "Today is Sat"
[1] "Today is Sun"

Can you loop over each row in mpg and print each row?

for(n in 1:nrow(mtcars)){
  print(mtcars$mpg[n])
}
[1] 21
[1] 21
[1] 22.8
[1] 21.4
[1] 18.7
[1] 18.1
[1] 14.3
[1] 24.4
[1] 22.8
[1] 19.2
[1] 17.8
[1] 16.4
[1] 17.3
[1] 15.2
[1] 10.4
[1] 10.4
[1] 14.7
[1] 32.4
[1] 30.4
[1] 33.9
[1] 21.5
[1] 15.5
[1] 15.2
[1] 13.3
[1] 19.2
[1] 27.3
[1] 26
[1] 30.4
[1] 15.8
[1] 19.7
[1] 15
[1] 21.4

Apply Functions

apply 1 over rows, 2 over columns, c(1,2) over both columns and rows

## Sum over rows

apply(mtcars, 1, sum)
 [1] 328.980 329.795 259.580 426.135 590.310 385.540 656.920 270.980 299.570
[10] 350.460 349.660 510.740 511.500 509.850 728.560 726.644 725.695 213.850
[19] 195.165 206.955 273.775 519.650 506.085 646.280 631.175 208.215 272.570
[28] 273.683 670.690 379.590 694.710 288.890
## Mean over columns

apply(mtcars, 2, mean)
       mpg        cyl       disp         hp       drat         wt       qsec 
 20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
        vs         am       gear       carb 
  0.437500   0.406250   3.687500   2.812500 

rowMeans, colMeans, rowSums, colSums are wrappers of apply functions

rowMeans(mtcars)
 [1] 29.90727 29.98136 23.59818 38.73955 53.66455 35.04909 59.72000 24.63455
 [9] 27.23364 31.86000 31.78727 46.43091 46.50000 46.35000 66.23273 66.05855
[17] 65.97227 19.44091 17.74227 18.81409 24.88864 47.24091 46.00773 58.75273
[25] 57.37955 18.92864 24.77909 24.88027 60.97182 34.50818 63.15545 26.26273
colMeans(mtcars)
       mpg        cyl       disp         hp       drat         wt       qsec 
 20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
        vs         am       gear       carb 
  0.437500   0.406250   3.687500   2.812500 
rowSums(mtcars)
 [1] 328.980 329.795 259.580 426.135 590.310 385.540 656.920 270.980 299.570
[10] 350.460 349.660 510.740 511.500 509.850 728.560 726.644 725.695 213.850
[19] 195.165 206.955 273.775 519.650 506.085 646.280 631.175 208.215 272.570
[28] 273.683 670.690 379.590 694.710 288.890
colSums(mtcars)
     mpg      cyl     disp       hp     drat       wt     qsec       vs 
 642.900  198.000 7383.100 4694.000  115.090  102.952  571.160   14.000 
      am     gear     carb 
  13.000  118.000   90.000 

There are a whole family of apply functions…

  • apply apply(x, MARGIN, FUN) Apply a function to the rows or columns or both Data frame or matrix vector, list, array

  • lapply lapply(X, FUN) Apply a function to all the elements of the input List, vector or data frame list

  • sapply sapply(X, FUN) Apply a function to all the elements of the input List, vector or data frame vector or matrix

Combining for Loop and Custom Functions!

Use for loops to examine the weighted mean of mpg using cyl, disp, hp, drat, wt, and qsec as weights Let’s also compare the weighted means to the overall mean

preds <- c("cyl", "disp", "hp", "drat", "wt", "qsec")
wm_output <- vector()
for (p in preds) {
  wm_output[p] <- wMean(mtcars['mpg'], mtcars[p])
}
wm_output
     cyl     disp       hp     drat       wt     qsec 
18.65455 17.43239 17.97245 20.68188 18.54993 20.33536 
wm_output - mean(mtcars[,'mpg'])
       cyl       disp         hp       drat         wt       qsec 
-1.4360795 -2.6582348 -2.1181708  0.5912501 -1.5406911  0.2447364 

While statement

While statements will keep on going until the condition is fulfilled (i.e., TRUE). Be careful with these because if your code is ‘broken’ and the condition is unable to be made TRUE, it will run indefinitely.

filler <- 0
while(filler < 100000000){
  filler <- filler + 1
}

Merge Dataframes

Create origin dataframe

producers <- data.frame(   
    surname =  c("Spielberg","Scorsese","Hitchcock","Tarantino","Polanski"),    
    nationality = c("US","US","UK","US","Poland"),    
    stringsAsFactors=FALSE)

Create destination dataframe

movies <- data.frame(    
    surname = c("Spielberg",
        "Scorsese",
                "Hitchcock",
                "Hitchcock",
                "Spielberg",
                "Tarantino",
                "Polanski"),    
    title = c("Super 8",
            "Taxi Driver",
            "Psycho",
            "North by Northwest",
            "Catch Me If You Can",
            "Reservoir Dogs","Chinatown"),                
            stringsAsFactors=FALSE)

Inspect Structure of Dataframe

str(movies)
'data.frame':   7 obs. of  2 variables:
 $ surname: chr  "Spielberg" "Scorsese" "Hitchcock" "Hitchcock" ...
 $ title  : chr  "Super 8" "Taxi Driver" "Psycho" "North by Northwest" ...
str(producers)
'data.frame':   5 obs. of  2 variables:
 $ surname    : chr  "Spielberg" "Scorsese" "Hitchcock" "Tarantino" ...
 $ nationality: chr  "US" "US" "UK" "US" ...

Merge two datasets

m1 <- merge(producers, movies, by = "surname")
m1
    surname nationality               title
1 Hitchcock          UK              Psycho
2 Hitchcock          UK  North by Northwest
3  Polanski      Poland           Chinatown
4  Scorsese          US         Taxi Driver
5 Spielberg          US             Super 8
6 Spielberg          US Catch Me If You Can
7 Tarantino          US      Reservoir Dogs

Can also merge by different variable names for first and second dataframe using by.x and by.y

Inner Join

An inner join (actually a natural join), is the most usual join of data sets that you can perform. It consists on merging two dataframes in one that contains the common elements of both, as described in the following illustration:

m1 <- merge(producers, movies, by = "surname")

Outer (Full) Join

The outer join, also known as full outer join or full join, merges all the columns of both data sets into one for all elements.

merge(producers, movies, by = "surname", all = TRUE)
    surname nationality               title
1 Hitchcock          UK              Psycho
2 Hitchcock          UK  North by Northwest
3  Polanski      Poland           Chinatown
4  Scorsese          US         Taxi Driver
5 Spielberg          US             Super 8
6 Spielberg          US Catch Me If You Can
7 Tarantino          US      Reservoir Dogs

Left Join

The left join in R consist on matching all the rows in the first data frame with the corresponding values on the second. Recall that ‘Jack’ was on the first table but not on the second.

merge(producers, movies, by = "surname", all.x = TRUE)
    surname nationality               title
1 Hitchcock          UK              Psycho
2 Hitchcock          UK  North by Northwest
3  Polanski      Poland           Chinatown
4  Scorsese          US         Taxi Driver
5 Spielberg          US             Super 8
6 Spielberg          US Catch Me If You Can
7 Tarantino          US      Reservoir Dogs

Right Join

The right join in R is the opposite of the left outer join. In this case, the merge consists on joining all the rows in the second data frame with the corresponding on the first.

merge(producers, movies, by = "surname", all.y = TRUE)
    surname nationality               title
1 Hitchcock          UK              Psycho
2 Hitchcock          UK  North by Northwest
3  Polanski      Poland           Chinatown
4  Scorsese          US         Taxi Driver
5 Spielberg          US             Super 8
6 Spielberg          US Catch Me If You Can
7 Tarantino          US      Reservoir Dogs

Filtering and Wrangling

Sorting

mtcars <- mtcars[order(mtcars$mpg), ]
mtcars
    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
15 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
16 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
24 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
7  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
17 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
31 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
14 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
23 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
22 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
29 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
12 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
13 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
11 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
6  18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
5  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
10 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
25 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
30 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
4  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
32 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
21 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
9  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
8  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
27 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
26 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
19 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
28 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
18 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
20 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1

Filtering

mtcars4 <- subset(mtcars, carb==4)
mtcars4
    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
15 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
16 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
24 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
7  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
17 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
29 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
11 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
10 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4

Pivoting

wideMtcars <- mtcars %>% pivot_wider(names_from = gear, values_from = mpg)
wideMtcars
# A tibble: 32 × 12
     cyl  disp    hp  drat    wt  qsec    vs    am  carb   `3`   `5`   `4`
   <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> <dbl> <dbl> <dbl>
 1     8  472    205  2.93  5.25  18.0     0     0     4  10.4  NA      NA
 2     8  460    215  3     5.42  17.8     0     0     4  10.4  NA      NA
 3     8  350    245  3.73  3.84  15.4     0     0     4  13.3  NA      NA
 4     8  360    245  3.21  3.57  15.8     0     0     4  14.3  NA      NA
 5     8  440    230  3.23  5.34  17.4     0     0     4  14.7  NA      NA
 6     8  301    335  3.54  3.57  14.6     0     1     8  NA    15      NA
 7     8  276.   180  3.07  3.78  18       0     0     3  15.2  NA      NA
 8     8  304    150  3.15  3.44  17.3     0     0     2  15.2  NA      NA
 9     8  318    150  2.76  3.52  16.9     0     0     2  15.5  NA      NA
10     8  351    264  4.22  3.17  14.5     0     1     4  NA    15.8    NA
# … with 22 more rows
# ℹ Use `print(n = ...)` to see more rows